Project Report

Ashley’s Lost Phone

Author

Kaylee Mo, Nicket Mauskar, Luca Moretti, and Ashley Witarsa

Published

March 9, 2023

Abstract
The return of academic investment, or the value of a college education relative to its cost, is an increasingly important issue for students, families, and policymakers. We wonder exactly what, and if there are any, characteristics of colleges that lead to higher monetary returns after four-years of university. To accomplish this, we developed a multi-linear regression model based on datasets pertaining to universities and various attributes that define them in order to 1. Establish any outstanding explanatory variables that correlate with the median income of these universities and 2. Be able to predict income values based on universities and their data. By creating synthetic test data to evaluate our model and optimizing the resulting RMSE of the predictions, we were able to determine that selectivity(based on Barron’s selectivity scale) and undergraduate population had the highest correlation with median income, while school rank, surprisingly, had the weakest correlation..

1 Background / Motivation

The return of academic investment, or the value of a college education relative to its cost, is an increasingly important issue for students, families, and policymakers. In 2023, the value of a college education is likely to be even more closely scrutinized as students and families seek to make informed decisions about their investments in higher education. In fact, a recent study by Georgetown University found that the median ROI for a bachelor’s degree in the United States is 102%, but this varies widely by major, with some majors having ROIs as high as 300%[1]. While this study only focuses on the majors and degrees, we hope to make similar conclusions regarding the schools themselves.

Thus, as we are nearing graduation as upperclassmen in college, we are now more than ever concerned about the return on academic investment given the current economic climate. We wonder exactly what, and if there are any, characteristics of colleges that lead to higher monetary returns after four-years of university.

2 Problem statement

Our problem is a regression problem because our response variable (Y) is the mid-career salary of graduated undergraduate students, which is a continuous variable. We are interested in using inference and prediction because we are identifying the relationship between multiple university predictors – Baron’s selectivity scale, student/faculty ratio, student population, and average tuition costs – and the resulting return on education (measured here by mid-career salary) in order to make a tool for the stakeholders mentioned below. For instance, we are going to attempt to identify if a university’s average tuition costs impact the students’ mid-career salary.

Because our proposal is a regression problem, we will be assessing model accuracy with the root mean squared error (RMSE). We will optimize RMSE because we want to penalize larger outliers more. For example, certain schools that are not as selective may have an exceptionally high mid-year salary in the dataset because of certain outlier high-performing individuals that skew the average – we want our model to be more sensitive to these outliers.

3 Data sources

  1. https://www.kaggle.com/datasets/kabhishm/top-american-colleges-2022 This dataset contains information about US colleges, including their student/faculty ratio, rank, undergraduate population, and mid-career salary. This dataset will help us understand how different variables affect average mid-career salary expectations.
  2. https://www.kaggle.com/datasets/jessemostipak/college-tuition-diversity-and-pay?select=tuition_cost.csv This dataset contains information about tuition and total costs of US colleges. It distinguishes between in-state and out-of-state tuition for public colleges. This dataset will help us understand how tuition affects average mid-career salary expectations.
  3. https://www.jkcf.org/wp-content/uploads/2018/06/The_Transfer_Process-2015_list_of_selective_colleges.pdf This dataset information on the selectivity of US colleges. It uses Barron’s Profile of American Colleges to put schools into three categories: “competitive”, “less competitive”, and “non-competitive”. This dataset will help us understand how selectivity affects average mid-career salary expectations.

4 Stakeholders

Parents with children graduating from high school – they to ensure their kids are receiving an education that is worth the investment: their hope is that the higher education, the higher the lifetime earnings their children will gross. Using the information from this project will help parents make informed decisions.

Students – they can benefit from this project by using the information to determine which universities they want to apply to and/or what universities are worth attending.

University administration and board members – especially those working at the most selective schools – are interested in having a high mid-career salary association with higher tuition costs and selectivity, and lower student/faculty ratio and student population to generate revenue.

Investors - Similar to administration and board members, investors are similarly interested in these statistics to determine whether or not their investment will have a higher return. The federal and state governments want to see this kind of information to inform their annual budgets and how much money to allocate to U.S. universities.

5 Data quality check / cleaning / preparation

To start, we had to combine our 3 datasets. We left joined the Barrons Selectivity dataset and the Tutition dataset with the Top Colleges List because it had the most colleges in the dataset. We also had to import fuzzymatcher to help with left joining because the names of the colleges did not match exactly within the datasets (for example: UC Berkeley vs University of California, Berkleye). Fuzzymatcher takes the 2 college names and based on their similarity, creates a score. Fuzzymatcher then uses a threshold, and if that score is above the threshold, the 2 college names are merged together. Above is the distribution of the variables in the dataset.

The dataset intitally had many columns than that in the summary statistics table above. We dropped any columns relating to the location of the University(state, longitude, latitude), and any columns that we did not deem as an important predictor for median base salary of undergraduate universities(room and board cost, degree length). |The main variables we were focused on in our analysis were medianBaseSalary, rank, studentFacultyRatio, undergradPop, out_of_state_tuition, Selectivity, and totalGrantAid. Selectivity was the only variable that was categorical with having values of 1 (most selective), 2 (selective), and 3 (least selective).

Lastly, to prep the data for model development, we checked the null values of the variables we were interested in. As you can see below, there were only 7 numbers missing max for a variable, which we found to be a relatively small number compared to the total number of observations. We dropped all null values and then started model development.

6 Exploratory data analysis

We first started looking at the correlations of the variables with our predictor, medianbaseSalary. We wanted to include the variables with the highest correlation with medianbaseSalary. For instance, rank (-0.645), studentFacultyRatio (-0.443), and out_of_state_tutition (0.578837) were variables we wanted to include because they had the highest correlations. We also wanted to include totalGrantAid and undergradPop because we were still interested in how these variables impacted medianbaseSalary.

We also created a pairplot to look at the the distribution of each predictor against medianbaseSalary. This was important for when we wanted to look at variable transformations when we were developing the model. As we can see, a few of the variables do not have a linear regression with medianbaseSalary.

7 Approach

We used a multiple linear regression model. We prioritized optimizing RMSE and R2 in our model. RMSE tells us the average difference between the values that are predicted by our model, and the actual values. We wanted to make sure that our model was not overfitting or underfitting. R2 tells us how much variance in the average median salary is explained by the predictor variables we selected, also known as the goodness-of-fit test. It allows us to make better inference from the training data.

Given that our data was solely focused on colleges and their median income, it posed a challenge to divide the dataset into training and test data sets without sacrificing valuable information required to build the model. Additionally, as the dataset did not specify the time frame within which the median income was calculated for each college, predicting exact median incomes could have been inconsistent. To overcome this obstacle, one thing that we did differently was create a test dataset with 500 artificial “students” who attended 500 colleges selected at random from the original dataset. We merged the attributes and predictors of these colleges, including median income, with the test dataset to enable the availability of this information when running the model on the dataset. Lastly, to create a more precise training dataset that ensured that the students would not earn the exact median income as their respective universities, we introduced roughly $20,000 of noise for the median incomes. That way, for example, if student N went to Northwestern University with a median income of S , in the test data, they would have an income of anywhere between [S-10,000, S+10,000].

Our first model with no interactions worked in the sense that it was functional, however, the R^2 value was only .59. This is obviously not ideal, considering the amount of variables we had at our disposal and the potential interactions that we hadn’t yet explored. Thus, we used this first model as a basis for our exploration of the model interactions(explained more in the next section).

Our problem did not already have solutions posted on Kaggle. The dataset did not have much information outside of the data itself, nor was there any initial EDA provided.

8 Developing the model

To start, we created the model with the 6 variables we identified in the EDA. That model had a R^2 value of 0.596.

Next, we wanted to address multicollinearity. We did a VIF analysis to determine if any of the variables were collinear with each other.

As you can see, none of the variables had a VIF value above 5. A VIF value above 5 typically means a variable has high collinearity, and because none of variables had VIFs above 5, we did not need to remove any of them from our initial model.

We then wanted to run best subset selection to determine if all 6 variables we had in the initial model were needed. We decided to do best subset selection because an initial model of only 6 variables would not take a longtime to run. Our total elapsed time for the best subset selection was 1.0185542106628418 seconds.

As you can see above, the best subset selection indicated that only 5 variables out of the 6 original variables were needed. We ran the model with the 5 variables and found that best subset selection told us to drop the variable ‘totalGrantAid’. The model with only 5 variables had the same R^2 (0.596) of the original model. This is because totalGrantAid in the original model had a p-value of 0.556, and was the only variable with a p-value above 0.05. Thus, totalGrantAid was an insignificant predictor for medianbaseSalary and was dropped.

We then wanted to look into adding interactions of the 5 variables in the model. To do so, we first wanted to try best subset selection with interactions. However, that resulted in runtimes over 4 hours and our computers would not run it. Thus, we looked at correlations between the variables and logically added different interactions to the model.

For instance, out_of_state_total & studentFacultyRatio seem to be highly correlated. We added that variable interaction to the model. However, that interaction did not increase R^2 at all and the interaction variable had a very high p-value. We tried this with other interaction terms like studentFacultyRatio & undergradPop, and no interactions seemed to have an impact on the model. Thus, we ended up not adding any variable interactions into the model.

Next, we wanted to look at variable transformations. Looking at the pairplot from the EDA, we knew we had to do a few variable transformations.

Taking a closer look at the 5 remaining variables, we decided to transform studentFacultyRatio and undergradPop. As you can tell, studentFacultyRatio seems to have a quadratic relationship with medianbaseSalary and undergradPop seems to have a logarithmic relationship with medianbaseSalary. We implemented these variable transformations, and the R^2 variable increased to 0.605.

Next, we wanted to check model assumptions.

We plotted the fitted values against the residual values. The model seems to satisfy the linearity assumption, as we do not observe a strong pattern in the resiudlas around the line Residuals = 0. The residuals are distributed more or less in a similar manner on both sides of the blue line for all fitted values.

The model seems to satisfy the constant variance assumption. We see that the variance of erros stays constant and close to the line Residuals = 0 with an increase in the fitted values.

Because both assumptions are met, we did not need to do any further changes to the model.

Next, we wanted to check for influential points. We looked at outliers and high leverage points first.

We identified 2 outliers and 2 high leverage points, as you can see above. However, the outliers and the high leverage points were not the same. Therefore, there were no influential points. No points had to be removed and no further changes needed to be done to the model.

Lastly, we checked for overfitting of the model. Using the randomized test dataset we created, we first checked the RMSE of the original model to the RMSE of our final model (the 5 variables and the variable transformations).

Our RMSE decreased when got rid of one variable and added the variable transformations. Though it only decreased by $226, it still decreased (which is what we want). Lastly, we calculated the RSE on the training data, and compared that to the RMSE of our final model.

Our RSE on the training data was 10,868 and our RMSE on the test data was 11,201. The RSE on the training data is close to the RMSE on the test data. This shows that the performance of the model on unknown data is comparable to its performance on the known data. This implies that the model is NOT overfitting, which is what we wanted!

I think it’s important to also note that an RSE and RMSE above 10,000 dollars seems really high. However, in terms of medianbaseSalary, an error of around 10,000 is not entirely bad. Typically, salaries are in the 6-figure range, so errors of 10,000 dollars is not horrible.

Thus, the final regression formula that we deemed was most accurate for our model is below:

Things to note from this equation: 1. Barron’s Selectivity is a decreasing metric in the sense that it ranges from 1 to 3, where 1 is ‘most competitive’ and 3 is ‘least competitive’. This makes sense as there’s a negative coefficient associated with it, indicating the more competitive the school(according to the scale), the higher the median base salary. 2. The same can be said about undergradPop in that the smaller the value, the better median base salary. 3. outofstatetuition’s coefficient is exceptionally small, indicating that it is not a very influential factor.

9 Limitations of the model with regard to inference / prediction

The reliability and accuracy of the response variable, “medianBaseSalary,” leads to some limitations in our conclusions. There are no details about how this metric was recorded, nor about the point in time these statistics were recorded after graduation; a graduate’s base salary ten years after university will be drastically different from that of a graduate only two years out. Additionally, it seems as though this statistic fails to acknowledge job occupation as well: those who become lawyers will yield a different base salary of those who become baristas, for example. Likewise, the metric fails to consider the classification of the school and the potential impact it may have on median income. For instance, a small liberal arts institution may have different constraints regarding professions and compensation in comparison to a STEM/technology-focused institution. Consequently, a student may excel in their field at the former school, but their success may not be reflected in the median income.. Thus, “medianBaseSalary” represents a very generalized statistic for each school with a large potential margin for error: this observation is reflected in the high RMSE value found within our model.

The “outofstatetuition” variable is also questionable in its accuracy in predicting median base salary. Depending on the type of school, the percentage of undergraduate students may be much higher than others, consequently influencing the effectiveness to which “outofstatetuition” is representative of median base salary.

The model will become statistically obsolete fairly quickly. Because of how quickly employment salary based on occupation and inflation changes, this model will not accurately reflect the salary of graduates only a few years out of college. For example, maybe just 10 years ago, considering the everclimbing importance of technology, the landscape of the various salaries and their relationships with one another have drastically changed. With the rise of AI and related technologies, it’s impossible to predict the job and salary landscape even 5 years from now!However, despite this, we believe that the infered trends and conclusions based on the observations of predictors are likely to persist.

10 Conclusions and Recommendations to stakeholder(s)

We concluded several observations from our final inference model. Collinearity was tested using VIF, predictors were tested for existing interaction terms, and the best group of predictors was found, leading to a subset that minimized the RMSE and R2 for the model. All predictors were statistically significant – p-values were less than 0.05. The model was tested against a test dataset, and the RSE and RMSE of the training and test data, respectively, were calculated and compared: the model was found to not overfit the data since the RSE on the training data is less than the RMSE of the test data. The coefficients of each predictor yielded the most significant conclusions. The “selectivity” and “undergraduate population” predictors have the largest change in median base salary, driving higher salaries for university graduates. The intercept is inferentially insignificant because the “selectivity” predictor cannot take on a value of 0. The “rank” predictor surprisingly has the smallest change in median base salary.

Stakeholders can greatly benefit from the model and conclusions above. For university administration, it’s important to identify “selectivity” and undergradPop” as the most influential predictors: minimizing both values – being very selective, and maintaining a low undergraduate population – will be more effective in yielding a higher median base salary for graduates versus minimizing or maximizing other predictors. Board members can use this information to adjust university admission to sustain an academic environment conducive to preparing students for high-paying careers. As for parents and students, the model and its results confirm the expectation that attending top tier schools affect future financial success. These schools, such as Northwestern, minimize “selectivity”, “undergradPop”, “studentFacultyRatio”, and “rank”, and maximize “outofstatetuition.” An analysis of these factors can help inform college application decisions and expectations: ideally, if the money is available since tuition will be high, students and parents should apply for those top schools. Finally, the model can help inform investment decisions made by the government and investors: if they see that a university’s graduates are achieving high mid-career salaries, they may be more likely to invest in that institution. Following the previous observations, investing top tier colleges may be a safer option for those looking to make money. Conversely, the government may want to invest in schools that have higher “selectivity”, “undergradPop”, “studentFacultyRatio”, and “rank”, and lower “outofstatetuition.” Additionally, the model can be used as an overall metric of the possible return on investment.

As previously discussed in the limitations section of the report, the reliability and accuracy of the response variable, “medianBaseSalary,” and “outofstatetuition” variable leads to some limitations in our recommendation to stakeholders. The uncertainty in the measurement of median base salary itself – how and at what point it was recorded – combined with the variability in student population paying out of state tuition limits the validity to which stakeholders can rely on the conclusive inferences; though, this variability and uncertainty is statistically reflected and acknowledged as RMSE.

GitHub and individual contribution

Link to github repository: https://github.com/nicketm/AshleysLostPhone

Individual contribution

Team member Contributed aspects Details Number of GitHub commits
Ashley Witarsa Data cleaning and EDA Cleaned data to impute missing values and developed visualizations to identify appropriate variable transformations. 27
Kaylee Mo Developing the Model and Checking Assumptions Checked for multicollinearity, linearity, and constant variance. Ran best subset selection to finalize the model. 11
Luca Moretti Outlier and influential points treatment Influential Point, High Leverage, and Code Fitting analysis of test data 8
Nicket Mauskar Test Data and Initial Prediction Analysis Maufactured synthetic test data and ran initial analysis of model effectiveness and fit 12

List the challenges you faced when collaborating with the team on GitHub. Are you comfortable using GitHub? Do you feel GitHuB made collaboration easier? If not, then why? (Individual team members can put their opinion separately, if different from the rest of the team)

There is definitely a learning curve using GitHub. Initially, we found difficulties when pushing and pulling, especially when different copies of the jupiter notebook conflicted with each other. Though we have become more accustomed to using GitHub as time went on, collaborating can still prove difficult at times. Waiting on other group members to make progress can be a hindrance, particularly when working in a similar area of the Jupyter notebook. In such cases, merge conflicts occured when we attempted to push our work.

References

List and number all bibliographical references. When referenced in the text, enclose the citation number in square brackets, for example [1].

[1] P. Carnevale, J. Strohl, and M. Melton, “The Economic Value of College Majors.” Georgetown University Center on Education and the Workforce, May 2015. Available online at: https://cew.georgetown.edu/cew-reports/valueofcollegemajors/.